We will use the following packages for this example:
if (F) {
install.packages(c("akima", "caret", "devtools", "earth", "gam", "ggplot2", "mgcv", "mlbench", "plotmo")) # run lines 16 and 17 manually if needed
devtools::install_github("ck37/ck37r")
}
library(akima)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ck37r)
library(devtools)
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.14
library(ggplot2)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
##
## Attaching package: 'mgcv'
## The following objects are masked from 'package:gam':
##
## gam, gam.control, gam.fit, plot.gam, predict.gam, s,
## summary.gam
library(mlbench)
library(plotmo)
## Loading required package: plotrix
## Loading required package: TeachingDemos
library(earth)
Use the “PimaIndiansDiabetes2” dataset to construct a generalized additive model (GAM) and multivariate additive regression model (MARS, aka EARTH). blood pressure will be the response variable. Missing data will be median-imputed and indicators will be created to document their missingness.
# load the dataset
data(PimaIndiansDiabetes2)
?PimaIndiansDiabetes2
data <- PimaIndiansDiabetes2 # give the data a simpler name
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 NA 70 96 ...
## $ triceps : num 35 29 NA 23 35 NA 32 NA 45 NA ...
## $ insulin : num NA NA NA 94 168 NA 88 NA 543 NA ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
Check for missing data:
# check for missing cases
sum(is.na(data))
## [1] 652
# how much of the data is missing?
sum(is.na(data)) / (nrow(data)*ncol(data)) # about 9%
## [1] 0.0943287
Recode the “diabetes” vector to numeric type:
data$diabetes <- ifelse(data$diabetes=="pos", 1, 0)
Use Chris K’s handy median impute function to impute missing values:
# impute and add missingness indicators
result = ck37r::impute_missing_values(data)
# overwrite "data" with new imputed data frame
data <- result$data
Double check that missing values have been imputed:
# no more NA values
sum(is.na(data))
## [1] 0
# check that missingness indicators have been added
str(data)
## 'data.frame': 768 obs. of 14 variables:
## $ pregnant : num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure : num 72 66 64 66 40 74 50 72 70 96 ...
## $ triceps : num 35 29 29 23 35 29 32 29 45 29 ...
## $ insulin : num 125 125 125 94 168 125 88 125 543 125 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 32.3 ...
## $ pedigree : num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes : num 1 0 1 0 1 0 1 0 1 1 ...
## $ miss_glucose : num 0 0 0 0 0 0 0 0 0 0 ...
## $ miss_pressure: num 0 0 0 0 0 0 0 1 0 0 ...
## $ miss_triceps : num 0 0 1 0 0 1 0 1 0 1 ...
## $ miss_insulin : num 1 1 1 0 0 1 0 1 0 1 ...
## $ miss_mass : num 0 0 0 0 0 0 0 0 0 1 ...
This semester, MLWG has explored linear, polynomial, and spline regression models using single predictors (March 3) as well as stepwise selection using multiple predictors (Feb 17). Deb also offered an informative take on splines earlier today (Mar 17). Last semester, we talked about improving linear regression models via penalized regression (LASSO and ridge) using multiple predictors (Nov 4).
When considering multilple predictor variables, another extension of multiple linear regression can be used - generalized additive models.
Generalized additive models (GAMs) are another extension of multiple linear regression. They are not bound by linear relationships between predictor and response variable and can instead incorporate smoothed, nonlinear relationships. Each relationships is computed and summed (thus making it “additive”). Smoothed splines are not the only constructs used to build GAMs, as they can be built using natural splines, local regression, polynomial regression, etc.
“Backfitting”, or updating the model as each predictor is approximated using penalized likelihood maximization, comprises the smoothed spline.
See Wood’s book for thorough walkthroughs of GAMs in R.
As always, we also encourage Introduction to Statistical Learning - Chapter 7 for a nice introductory overview and exercises.
See Faraway 2002 for a great intro to regression and ANOVA
Fit the GAM:
gam1 <- gam(pressure ~ s(glucose) + s(insulin) + s(age) + diabetes,
family="gaussian",
method="GCV.Cp",
data=data)
gam1
##
## Family: gaussian
## Link function: identity
##
## Formula:
## pressure ~ s(glucose) + s(insulin) + s(age) + diabetes
##
## Estimated degrees of freedom:
## 2.31 1.00 2.41 total = 7.73
##
## GCV score: 127.3935
# view summary output
gam.check(gam1)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 12 iterations.
## The RMS GCV score gradient at convergence was 1.109041e-05 .
## The Hessian was positive definite.
## Model rank = 29 / 29
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(glucose) 9.000 2.313 1.036 0.83
## s(insulin) 9.000 1.000 0.993 0.44
## s(age) 9.000 2.413 0.992 0.42
names(gam1)
## [1] "coefficients" "residuals" "fitted.values"
## [4] "family" "linear.predictors" "deviance"
## [7] "null.deviance" "iter" "weights"
## [10] "prior.weights" "df.null" "y"
## [13] "converged" "sig2" "edf"
## [16] "edf1" "hat" "R"
## [19] "boundary" "sp" "nsdf"
## [22] "Ve" "Vp" "rV"
## [25] "mgcv.conv" "gcv.ubre" "aic"
## [28] "rank" "gcv.ubre.dev" "scale.estimated"
## [31] "method" "smooth" "formula"
## [34] "var.summary" "cmX" "model"
## [37] "control" "terms" "pred.formula"
## [40] "pterms" "assign" "xlevels"
## [43] "offset" "df.residual" "min.edf"
## [46] "optimizer" "call"
gam1$aic
## [1] 5904.123
gam1$sig2
## [1] 126.1119
Play with some basic plotting features
plot(gam1, se=T,
shade=T, col="black", shade.col="gray80",
residuals=F,
pages=1)
title("gam1")
Our plots suggest that “glucose” is fairly linear. What if we compare gam1 to two other GAMs - one that excludes the predictor glucose, and another that assumes a linear relationship of glucose?
# model that excludes glucose
gam2 <- gam(pressure ~ s(insulin) + s(age) + diabetes,
family="gaussian",
method="GCV.Cp",
data=data)
plot(gam2, pages=1)
# model that assumes linear glucose
gam3 <- gam(pressure ~ glucose + s(insulin) + s(age) + diabetes,
family="gaussian",
method="GCV.Cp",
data=data)
plot(gam3, pages=1)
anova(gam1, gam2, gam3, test="F") # small p-value suggests that a non-linear function for glucose is preferable?
## Analysis of Deviance Table
##
## Model 1: pressure ~ s(glucose) + s(insulin) + s(age) + diabetes
## Model 2: pressure ~ s(insulin) + s(age) + diabetes
## Model 3: pressure ~ glucose + s(insulin) + s(age) + diabetes
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 759.04 95880
## 2 761.57 98098 -2.52839 -2218.4 6.9572 0.0003364 ***
## 3 760.94 96453 0.62785 1645.4 20.7813 0.0001390 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(gam1, gam2, gam3) # is this a multiple comparison problem?
## df AIC
## gam1 8.725948 5904.123
## gam2 6.594223 5917.426
## gam3 7.430271 5906.107
BIC(gam1, gam2, gam3)
## df BIC
## gam1 8.725948 5944.644
## gam2 6.594223 5948.048
## gam3 7.430271 5940.612
What if we want to identify unhelpful predictors and remove them for better results?
table(data$diabetes, I(data$pregnant>14))
##
## FALSE TRUE
## 0 500 0
## 1 266 2
gam4 <- gam(pressure ~ s(glucose) + s(insulin) + s(age) + diabetes,
family="gaussian",
data=data,
subset=(diabetes !=0))
plot(gam4, se=TRUE, seWithMean=TRUE,
shade=TRUE, col="blue", shade.col="lightgreen",
residuals=FALSE,
pages=1)
title("GAM - adjusted predictors")
AIC(gam1, gam2, gam3, gam4)
## df AIC
## gam1 8.725948 5904.123
## gam2 6.594223 5917.426
## gam3 7.430271 5906.107
## gam4 5.777994 2067.410
The “plotmo” R package offers a great way to visualize regression splines in three dimensions:
plotmo(gam1, all2=TRUE) # show simplfied seWithMean plots AND three dimensional splines for all variable relationships
## plotmo grid: diabetes glucose insulin age
## 0 117 125 29
# non-additive shapes have correlated effects in 3D plane surfaces.
# plot partial dependencies (takes a few minutes)
# plotmo(gam1, all2=TRUE, pmethod = "partdep")
# faster version of pmethod="partdep"
plotmo(gam1, all2=TRUE, pmethod = "apartdep",
caption = "What have I gotten myself in to...")
## calculating apartdep for diabetes
## calculating apartdep for glucose
## calculating apartdep for insulin
## calculating apartdep for age
## calculating apartdep for diabetes:glucose 0123456790
## calculating apartdep for diabetes:insulin 0123456790
## calculating apartdep for diabetes:age 0123456790
## calculating apartdep for glucose:insulin 01234567890
## calculating apartdep for glucose:age 01234567890
## calculating apartdep for insulin:age 01234567890
# let's play around with a few more parameters!
plotmo(gam1, all2=TRUE, pt.col = "green3")
## plotmo grid: diabetes glucose insulin age
## 0 117 125 29
plotmo(gam1, all2=TRUE, pt.col = "green3", smooth.col = "red")
## plotmo grid: diabetes glucose insulin age
## 0 117 125 29
plotmo(gam1, all2=TRUE,
pt.col = "green3",
smooth.col = "red",
grid.col="gray80")
## plotmo grid: diabetes glucose insulin age
## 0 117 125 29
# return just some of the plots!
plotmo(gam1, all2=TRUE, degree1 = c(1,2), degree2=0, col="tomato") # show just the first two predictor plots
## plotmo grid: diabetes glucose insulin age
## 0 117 125 29
plotmo(gam1, all2=TRUE, degree1 = 0, degree2 = 1, # return just glucose v. pregnant perspective plot
caption = "this is called a 'perspective plot'",
persp.col="orange")
See Wood S. 2006. Generalized additive models: An introduction with R for expert explanations.
Also check out Stephen Milborrow’s excellent instructions on the “plotmo” R package
Multivariate adaptive regression splines (MARS) are a technique developed by Jerome H. Friedman in 1991 and copyrighted by Salford Systems. Open source implementations are thusly referred to as “earth”, but may not be identical to MARS. Also see the “mda” R package and Friedman papers for specifics.
earth = Enhanced Adaptive Regression Through Hinges
These approaches use “surrogate features” (or, models of the models), usually versions of one or two predictors at a time. Each predictor is divided into two groups and each group models the outcome variable for each group. This creates a “piecewise linear model” where each new feature is some proportion of the data.
Group definitions are provided via linear regression models! Those with the smallest error are used. See Kuhn and Johnson, 2016:145 for more information.
Fit the earth model
# fit the model
set.seed(1)
earth1 <- earth(pressure ~ ., data=data,
degree=1, nk=5,
keepxy=TRUE, nprune=20, nfold=10, ncross=2,
pmethod="cv", trace=4)
## Call: earth(formula=pressure~., data=data, pmethod="cv", keepxy=TRUE,
## trace=4, degree=1, nprune=20, nfold=10, ncross=2, nk=5)
## === pmethod="cv": Preliminary model with pmethod="backward" ===
##
## x[768,13]:
## pregnant glucose triceps insulin mass pedigree age diabetes
## 1 6 148 35 125 33.6 0.627 50 1
## 2 1 85 29 125 26.6 0.351 31 0
## 3 8 183 29 125 23.3 0.672 32 1
## ... 1 89 23 94 28.1 0.167 21 0
## 768 1 93 31 125 30.4 0.315 23 0
## miss_glucose ...
## 1 0 ...
## 2 0 ...
## 3 0 ...
## ... 0 ...
## 768 0 ...
##
## y[768,1]:
## pressure
## 1 72
## 2 66
## 3 64
## ... 66
## 768 70
##
## Forward pass: minspan 7 endspan 11 x[768,13] 78 kB bx[768,5] 30 kB
##
## GRSq RSq DeltaRSq Pred PredName Cut Terms Par Deg
## 1 0.0000 0.0000 (Intercept)
## 2 0.1136 0.1229 0.1229 7 age 51 2 3 1
## 4 0.1785 0.1955 0.07266 5 mass 24.2 4 5 1 final (reached nk 5)
##
## Reached nk 5
## After forward pass GRSq 0.178 RSq 0.196
## Forward pass complete: 5 terms
##
## nprune=20
## Subset size GRSq RSq DeltaGRSq nPreds Terms (col nbr in bx)
## 1 0.0000 0.0000 0.0000 0 1
## 2 0.1153 0.1199 0.1153 1 1 3
## chosen 3 0.1836 0.1921 0.0683 2 1 3 4
## 4 0.1824 0.1952 -0.0012 2 1 3 4 5
## 5 0.1785 0.1955 -0.0039 2 1 2 3 4 5
##
## Prune method "backward" penalty 2 nprune 20: selected 3 of 5 terms, and 2 of 13 preds
## After pruning pass GRSq 0.184 RSq 0.192
## Full model GRSq 0.184 RSq 0.192, starting cross validation
## CV fold 1.1 CVRSq 0.223 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.2 CVRSq 0.219 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.3 CVRSq 0.203 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.4 CVRSq 0.244 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.5 CVRSq 0.118 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.6 CVRSq 0.034 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.7 CVRSq 0.110 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.8 CVRSq 0.182 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 1.9 CVRSq 0.129 n.oof 692 10% n.infold.nz 692 100% n.oof.nz 76 100%
## CV fold 1.10 CVRSq 0.156 n.oof 692 10% n.infold.nz 692 100% n.oof.nz 76 100%
## CV fold 2.1 CVRSq 0.092 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.2 CVRSq 0.067 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.3 CVRSq 0.226 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.4 CVRSq 0.020 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.5 CVRSq 0.175 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.6 CVRSq 0.287 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.7 CVRSq 0.101 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.8 CVRSq 0.300 n.oof 691 10% n.infold.nz 691 100% n.oof.nz 77 100%
## CV fold 2.9 CVRSq 0.038 n.oof 692 10% n.infold.nz 692 100% n.oof.nz 76 100%
## CV fold 2.10 CVRSq 0.255 n.oof 692 10% n.infold.nz 692 100% n.oof.nz 76 100%
## CV all CVRSq 0.159 n.infold.nz 768 100%
##
## === pmethod="cv": Calling update.earth internally for nterms.selected.by.cv=3 ===
##
## update.earth: using 768 by 14 data saved by keepxy in original call to earth
## Call: earth(formula=pressure~., data=data.frame[768,14], pmethod="cv",
## keepxy=TRUE, trace=trace, glm=glm, degree=1,
## nprune=nterms.selected.by.cv, nfold=0, ncross=1,
## varmod.method="none", nk=5, Object=rv)
##
## x[768,13]:
## pregnant glucose triceps insulin mass pedigree age diabetes
## 1 6 148 35 125 33.6 0.627 50 1
## 2 1 85 29 125 26.6 0.351 31 0
## 3 8 183 29 125 23.3 0.672 32 1
## ... 1 89 23 94 28.1 0.167 21 0
## 768 1 93 31 125 30.4 0.315 23 0
## miss_glucose ...
## 1 0 ...
## 2 0 ...
## 3 0 ...
## ... 0 ...
## 768 0 ...
##
## y[768,1]:
## pressure
## 1 72
## 2 66
## 3 64
## ... 66
## 768 70
##
## Skipped forward pass
## Subset size GRSq RSq DeltaGRSq nPreds Terms (col nbr in bx)
## 1 0.0000 0.0000 0.0000 0 1
## 2 0.1153 0.1199 0.1153 1 1 3
## chosen 3 0.1836 0.1921 0.0683 2 1 3 4
## 4 0.1824 0.1952 -0.0012 2 1 3 4 5
## 5 0.1785 0.1955 -0.0039 2 1 2 3 4 5
##
## Prune method "cv" penalty 2: selected 3 of 5 terms, and 2 of 13 preds
## After pruning pass GRSq 0.184 RSq 0.192
# view summary output
summary(earth1, details=TRUE)
## Call: earth(formula=pressure~., data=data, pmethod="cv", keepxy=TRUE,
## trace=4, degree=1, nprune=20, nfold=10, ncross=2, nk=5)
##
## coefficients
## (Intercept) 75.629539
## h(mass-24.2) 0.498705
## h(51-age) -0.402813
##
## Number of cases: 768
## Selected 3 of 5 terms, and 2 of 13 predictors using pmethod="cv"
## Termination condition: Reached nk 5
## Importance: age, mass, pregnant-unused, glucose-unused, ...
## Number of terms at each degree of interaction: 1 2 (additive model)
## GRSq 0.1836131 RSq 0.192106 mean.oof.RSq 0.1691089 (sd 0.0797)
##
## pmethod="backward" would have selected the same model:
## 3 terms 2 preds, GRSq 0.1836131 RSq 0.192106 mean.oof.RSq 0.1691089
# view predictor importance
evimp(earth1)
## nsubsets gcv rss
## age 2 100.0 100.0
## mass 1 61.0 61.3
# compute predicted values
earth_pred <- predict(earth1)
# print accuracy
(mse <- mean((data$pressure - earth_pred)^2))
## [1] 118.0642
Earth plots
# plot
# png("earth1.png")
plot(earth1)
# dev.off()
plot(earth1, info=T, type="response", trace=1)
## stats::residuals(object=earth.object, type="response")
## stats::fitted(object=earth.object)
## got model response from object$y
##
## training rsq 0.19
plotmo(earth1, info=T, type="response", trace=1)#, level=.9)
## stats::predict(earth.object, NULL, type="response", info=TRUE)
## Warning: predict.earth ignored argument 'info'
## stats::fitted(object=earth.object)
## got model response from object$y
##
## plotmo grid: pregnant glucose triceps insulin mass pedigree age
## 3 117 29 125 32.3 0.3725 29
## diabetes miss_glucose miss_pressure miss_triceps miss_insulin miss_mass
## 0 0 0 0 0 0
## Warning: predict.earth ignored argument 'info'
## Warning: predict.earth ignored argument 'info'
# 3d MARS plots!
# same syntactical rules apply here as well
plotmo(earth1)
## plotmo grid: pregnant glucose triceps insulin mass pedigree age
## 3 117 29 125 32.3 0.3725 29
## diabetes miss_glucose miss_pressure miss_triceps miss_insulin miss_mass
## 0 0 0 0 0 0
plotmo(earth1, all2=TRUE, persp.col="azure")
## plotmo grid: pregnant glucose triceps insulin mass pedigree age
## 3 117 29 125 32.3 0.3725 29
## diabetes miss_glucose miss_pressure miss_triceps miss_insulin miss_mass
## 0 0 0 0 0 0
We can also see the ideal number of terms
control <- trainControl(method = "repeatedcv",
repeats = 1, number = 1)
grid <- expand.grid(.degree = 1, .nprune = 2:25)
earth_best_terms <- train(pressure ~ ., data = data, method = "earth",
tuneGrid= grid)
earth_best_terms
## Multivariate Adaptive Regression Spline
##
## 768 samples
## 13 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 768, 768, 768, 768, 768, 768, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared
## 2 11.35408 0.1073283
## 3 10.89617 0.1793008
## 4 10.93983 0.1760370
## 5 11.02425 0.1692765
## 6 11.04741 0.1662241
## 7 11.12699 0.1582005
## 8 11.14840 0.1586616
## 9 11.18532 0.1575292
## 10 11.23732 0.1538041
## 11 11.26693 0.1516627
## 12 11.29749 0.1494507
## 13 11.30668 0.1499358
## 14 11.38831 0.1487384
## 15 11.42374 0.1469635
## 16 11.44278 0.1454418
## 17 11.44987 0.1448343
## 18 11.44689 0.1451135
## 19 11.44689 0.1451135
## 20 11.44689 0.1451135
## 21 11.44689 0.1451135
## 22 11.44689 0.1451135
## 23 11.44689 0.1451135
## 24 11.44689 0.1451135
## 25 11.44689 0.1451135
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
plot(earth_best_terms)
TODO: - determine best value for nfold - explore the ncross argument - plot cross validation results - collect \(R^2\) in different ways - use cross-validation to select the number of terms - better discuss partial dependence plots - include confidence intervals versus prediction intervals - investigate assumptions of prediction intervals - include text about interpretaiton of 3D plotmo regression surfaces - comprehensively discuss limitations
See Stephen Milborrow’s excellent notes on earth here for lots of handy tips and tricks.